33.2 LiteLLM 网关部署

16 分钟阅读

33.2.1 LiteLLM 简介#

LiteLLM 是一个开源的 LLM 网关,支持 100+ 个 LLM 提供商,包括 Anthropic、OpenAI、Cohere 等。它提供了统一的 API 接口,简化了多提供商的使用和管理。

LiteLLM 的核心特性#

  1. 多提供商支持:支持 100+ LLM 提供商
  2. 统一 API:一致的 API 接口,简化集成
  3. 智能缓存:内置缓存机制,减少成本和延迟
  4. 速率限制:可配置的速率限制,控制使用
  5. 成本跟踪:详细的使用情况和成本分析
  6. 负载均衡:在多个 API 密钥之间分配请求
  7. 失败重试:自动重试失败的请求
  8. 流式响应:支持流式输出

LiteLLM 架构#

┌─────────────────────────────────────────┐ │ Claude Code 客户端 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LiteLLM Proxy │ │ ┌──────────────────────────────┐ │ │ │ API 层 │ │ │ │ (Anthropic、OpenAI 等) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 缓存层 │ │ │ │ (Redis、Memcached) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 监控层 │ │ │ │ (Prometheus、Grafana) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM 提供商 │ │ (Anthropic、OpenAI、Cohere 等) │ └─────────────────────────────────────────┘

33.2.2 安装和配置#

1. 安装 LiteLLM#

使用 Docker 安装(推荐)

bash
bash # 拉取 LiteLLM 镜像 docker pull litellm/litellm:latest # 创建配置目录 mkdir -p ~/litellm/config cd ~/litellm # 创建配置文件 cat > config.yaml << EOF model_list: - model_name: claude-sonnet-4 litellm_params: model: claude-sonnet-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY - model_name: claude-opus-4 litellm_params: model: claude-opus-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY - model_name: claude-haiku-4 litellm_params: model: claude-haiku-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY litellm_settings: drop_params: true set_verbose: true general_settings: master_key: sk-litellm-master-key-123456 database_url: postgresql://user:password@localhost:5432/litellm security_settings: valid_api_keys: - sk-team-a-key-123 - sk-team-b-key-456 EOF # 启动 LiteLLM docker run -d \ --name litellm \ -p 4000:4000 \ -v $(pwd)/config.yaml:/app/config.yaml \ -e ANTHROPIC_API_KEY=sk-ant-xxx \ litellm/litellm:latest ```#### 使用 Python 安装 # 安装 LiteLLM pip install litellm[proxy] # 初始化配置 litellm init # 编辑配置文件 nano litellm_config.yaml # 启动代理服务器 litellm proxy --config litellm_config.yaml --port 4000

2. 配置文件详解#

yaml
```yaml # litellm_config.yaml # 模型列表 model_list: # Anthropic Claude 模型 - model_name: claude-sonnet-4 litellm_params: model: claude-sonnet-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY api_base: https://api.anthropic.com max_tokens: 4096 temperature: 0.7 - model_name: claude-opus-4 litellm_params: model: claude-opus-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY max_tokens: 4096 - model_name: claude-haiku-4 litellm_params: model: claude-haiku-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY max_tokens: 4096 # Amazon Bedrock 模型 - model_name: bedrock-claude-sonnet litellm_params: model: anthropic.claude-sonnet-4-5-20250929-v1:0 api_base: https://bedrock-runtime.us-east-1.amazonaws.com api_key: os.environ/AWS_ACCESS_KEY_ID aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY aws_region_name: us-east-1 # Google Vertex AI 模型 - model_name: vertex-claude-sonnet litellm_params: model: claude-sonnet-4-5@20250929 api_base: https://us-central1-aiplatform.googleapis.com api_key: os.environ/GOOGLE_APPLICATION_CREDENTIALS vertex_project: os.environ/VERTEX_PROJECT_ID vertex_location: us-central1 # LiteLLM 设置 litellm_settings: drop_params: true # 删除未使用的参数 set_verbose: true # 启用详细日志 json_logs: true # JSON 格式日志 success_callback: http://localhost:5000/callback # 成功回调 failure_callback: http://localhost:5000/failure # 失败回调 # 通用设置 general_settings: master_key: sk-litellm-master-key-123456 # 主密钥 database_url: postgresql://user:password@localhost:5432/litellm # 数据库 URL cache: redis://localhost:6379 # Redis 缓存 cache_seconds: 3600 # 缓存时间(秒) # 安全设置 security_settings: valid_api_keys: # 有效的 API 密钥 - sk-team-a-key-123 - sk-team-b-key-456 - sk-team-c-key-789 max_budget: 1000.0 # 最大预算(美元) budget_duration: monthly # 预算周期 rpm_limit: 100 # 每分钟请求数限制 tpm_limit: 10000 # 每分钟令牌数限制 # 负载均衡设置 load_balancing_settings: routing_strategy: usage-based # 路由策略:usage-based, round-robin, least-latency health_check: true # 启用健康检查 health_check_interval: 60 # 健康检查间隔(秒) # 监控设置 monitoring_settings: enable_prometheus: true # 启用 Prometheus prometheus_port: 9090 # Prometheus 端口 enable_slack_alerts: true # 启用 Slack 告警 slack_webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz alert_thresholds: error_rate: 0.05 # 错误率阈值 latency_p99: 5000 # P99 延迟阈值(毫秒) ```## 33.2.3 高级配置 ### 1. 缓存配置 # 缓存设置 cache_settings: type: redis # 缓存类型:redis, memory, none redis_url: redis://localhost:6379/0 cache_ttl: 3600 # 缓存生存时间(秒) cache_key_prefix: litellm # 缓存键前缀 enable_cache_for_stream: false # 是否为流式响应启用缓存 cache_control_headers: true # 是否使用缓存控制头

2. 速率限制配置#

yaml
```yaml # 速率限制设置 rate_limit_settings: enabled: true strategy: sliding_window # 策略:sliding_window, token_bucket, fixed_window limits: - api_key: sk-team-a-key-123 rpm: 100 # 每分钟请求数 tpm: 10000 # 每分钟令牌数 rpd: 10000 # 每天请求数 - api_key: sk-team-b-key-456 rpm: 50 tpm: 5000 rpd: 5000 default_limits: rpm: 10 tpm: 1000 rpd: 100 burst_size: 20 # 突发大小 ```### 3. 预算控制配置 # 预算设置 budget_settings: enabled: true currency: USD budgets: - name: team-a-budget api_keys: - sk-team-a-key-123 limit: 1000.0 period: monthly alert_threshold: 0.8 # 在 80% 时告警 hard_limit: true # 达到限制时阻止请求 - name: team-b-budget api_keys: - sk-team-b-key-456 limit: 500.0 period: monthly alert_threshold: 0.9 hard_limit: false cost_tracking: enabled: true update_interval: 60 # 更新间隔(秒) storage: database # 存储方式:database, file

4. 监控和告警配置#

yaml
```yaml # 监控设置 monitoring_settings: prometheus: enabled: true port: 9090 metrics: - request_count - request_duration - error_count - cache_hit_rate - token_usage - cost grafana: enabled: true dashboard_url: http://localhost:3000/d/litellm alerts: slack: enabled: true webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz channels: - litellm-alerts - devops-notifications alert_rules: - name: high_error_rate condition: error_rate > 0.05 duration: 5m severity: warning - name: high_latency condition: p99_latency > 5000 duration: 2m severity: critical - name: budget_exceeded condition: budget_usage > 1.0 severity: critical email: enabled: true smtp_server: smtp.gmail.com smtp_port: 587 smtp_username: alerts@company.com smtp_password: ${SMTP_PASSWORD} from_address: litellm-alerts@company.com to_addresses: - devops@company.com - finance@company.com ```## 33.2.4 集成 Claude Code ### 1. 配置 Claude Code 使用 LiteLLM # 方法 1:使用统一端点(推荐) export ANTHROPIC_BASE_URL=https://litellm-server:4000 export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key # 方法 2:使用 Anthropic 格式端点 export ANTHROPIC_BASE_URL=https://litellm-server:4000/anthropic export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key

方法 3:使用 API 密钥辅助程序

创建辅助程序脚本

cat > ~/bin/get-litellm-key.sh << 'EOF' #!/bin/bash

从 Vault 获取密钥

vault kv get -field=api_key secret/litellm/claude-code EOF chmod +x ~/bin/get-litellm-key.sh

配置 Claude Code 使用辅助程序

cat > /.claude-code/settings.json << EOF { "apiKeyHelper": "/bin/get-litellm-key.sh", "env": { "ANTHROPIC_BASE_URL": "https://litellm-server:4000" } } EOF

bash
### 2. 验证配置

```python
```python

class LiteLLMValidator:
    """LiteLLM 验证器"""

    def __init__(self, gateway_url: str, auth_token: str):
        self.gateway_url = gateway_url
        self.auth_token = auth_token

    def validate_connection(self) -> ValidationResult:
        """验证连接"""
        result = ValidationResult()

        try:
            # 测试健康检查端点
            response = requests.get(
                f"{self.gateway_url}/health",
                headers={'Authorization': f'Bearer {self.auth_token}'},
                timeout=10
            )

            if response.status_code == 200:
                result.success = True
                result.message = "Connection successful"
            else:
                result.success = False
                result.message = f"Health check failed: {response.status_code}"

        except requests.exceptions.Timeout:
            result.success = False
            result.message = "Connection timeout"
        except requests.exceptions.ConnectionError:
            result.success = False
            result.message = "Connection error"
        except Exception as e:
            result.success = False
            result.message = f"Unexpected error: {str(e)}"

        return result

    def validate_model_access(self, model: str) -> ValidationResult:
        """验证模型访问"""
        result = ValidationResult()

        try:
            # 测试模型访问
            response = requests.post(
bash
            f"{self.gateway_url}/v1/completions",
            headers={
                'Authorization': f'Bearer {self.auth_token}',
                'Content-Type': 'application/json'
            },
            json={
                'model': model,
                'prompt': 'Hello',
                'max_tokens': 10
            },
            timeout=30
        )

        if response.status_code == 200:
            result.success = True
            result.message = f"Model {model} accessible"
        else:
            result.success = False
            result.message = f"Model access failed: {response.status_code}"
            result.error = response.text

    except Exception as e:
        result.success = False
        result.message = f"Model access error: {str(e)}"

    return result

def validate_all(self) -> ValidationReport:
    """验证所有配置"""
    report = ValidationReport()

    # 验证连接
    report.connection = self.validate_connection()

    # 验证模型访问
    models = ['claude-sonnet-4', 'claude-opus-4', 'claude-haiku-4']
    report.models = {}

    for model in models:
        report.models[model] = self.validate_model_access(model)

    # 生成摘要
    report.summary = self._generate_summary(report)

    return report

def _generate_summary(self, report: ValidationReport) -> str:
    """生成验证摘要"""
    summary = "LiteLLM Validation Summary:\n\n"

    summary += f"Connection: {'✓' if report.connection.success else '✗'} "
    summary += f"{report.connection.message}\n\n"

    summary += "Model Access:\n"
    for model, result in report.models.items():
        status = '✓' if result.success else '✗'
        summary += f"  {status} {model}: {result.message}\n"

    return summary
bash
### 1. Prometheus 监控

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm-server:9090']
metrics_path: '/metrics'

2. Grafana 仪表板#

json
```json { "dashboard": { "title": "LiteLLM Dashboard", "panels": [ { "title": "Request Rate", "targets": [ { "expr": "rate(litellm_request_count[1m])" } ] }, { "title": "Error Rate", "targets": [ { "expr": "rate(litellm_error_count[1m]) / rate(litellm_request_count[1m])" } ] }, { "title": "P99 Latency", "targets": [ { "expr": "histogram_quantile(0.99, rate(litellm_request_duration_bucket[1m]))" } ] }, { "title": "Cache Hit Rate", "targets": [ { "expr": "rate(litellm_cache_hits[1m]) / rate(litellm_cache_requests[1m])" } ] }, { "title": "Token Usage", "targets": [ { "expr": "rate(litellm_token_usage[1m])" } ] }, { "title": "Cost", "targets": [ { "expr": "litellm_cost_total" } ] } ] } } ```### 3. 日志管理 class LiteLLMLogManager: """LiteLLM 日志管理器""" def __init__(self, log_file: str): self.log_file = log_file self.log_parser = LiteLLMLogParser() def analyze_logs(self, start_time: datetime = None, end_time: datetime = None) -> LogAnalysis: """分析日志""" analysis = LogAnalysis() # 读取日志文件 with open(self.log_file, 'r') as f: logs = f.readlines() # 解析日志 parsed_logs = [] for log in logs: try: parsed = self.log_parser.parse(log) parsed_logs.append(parsed) except Exception as e: logger.warning(f"Failed to parse log: {e}") # 过滤时间范围 if start_time or end_time: parsed_logs = [ log for log in parsed_logs if (not start_time or log.timestamp >= start_time) and (not end_time or log.timestamp <= end_time) ] # 分析日志 analysis.total_requests = len(parsed_logs) analysis.successful_requests = sum( 1 for log in parsed_logs if log.status == 'success' ) analysis.failed_requests = sum( 1 for log in parsed_logs if log.status == 'error' ) analysis.error_rate = ( analysis.failed_requests / analysis.total_requests if analysis.total_requests > 0 else 0 ) # 分析延迟 latencies = [log.duration for log in parsed_logs if log.duration] if latencies: analysis.avg_latency = sum(latencies) / len(latencies) analysis.p50_latency = np.percentile(latencies, 50) analysis.p95_latency = np.percentile(latencies, 95) analysis.p99_latency = np.percentile(latencies, 99) # 分析令牌使用 analysis.total_tokens = sum( log.input_tokens + log.output_tokens for log in parsed_logs ) # 分析成本 analysis.total_cost = sum(log.cost for log in parsed_logs) return analysis def generate_report(self, analysis: LogAnalysis) -> str: """生成报告""" report = "LiteLLM Log Analysis Report\n" report += "=" * 50 + "\n\n" report += "Request Summary:\n" report += f" Total: {analysis.total_requests}\n" report += f" Successful: {analysis.successful_requests}\n" report += f" Failed: {analysis.failed_requests}\n" report += f" Error Rate: {analysis.error_rate:.2%}\n\n" report += "Latency (ms):\n" report += f" Average: {analysis.avg_latency:.0f}\n" report += f" P50: {analysis.p50_latency:.0f}\n" report += f" P95: {analysis.p95_latency:.0f}\n" report += f" P99: {analysis.p99_latency:.0f}\n\n" report += "Token Usage:\n" report += f" Total: {analysis.total_tokens:,}\n\n" report += "Cost:\n" report += f" Total: ${analysis.total_cost:.2f}\n" return report

标记本节教程为已读

记录您的学习进度,方便后续查看。